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Abstract — We propose a new information theoretic metric 
for finding periodicities in stellar light curves. Light curves 
are astronomical time series of brightness over time, and are 
characterized as being noisy and unevenly sampled. The proposed 
metric combines correntropy (generalized correlation) with a 
periodic kernel to measure similarity among samples separated 
by a given period. The new metric provides a periodogram, called 
Correntropy Kernelized Periodogram (CKP), whose peaks are 
associated with the fundamental frequencies present in the data. 
The CKP does not require any resampling, slotting or folding 
scheme as it is computed directly from the available samples. 
CKP is the main part of a fully-automated pipeline for periodic 
light curve discrimination to be used in astronomical survey 
databases. We show that the CKP method outperformed the 
slotted correntropy, and conventional methods used in astronomy 
for periodicity discrimination and period estimation tasks, using a 
set of light curves drawn from the MACHO survey. The proposed 
metric achieved 97.2% of true positives with 0% of false positives 
at the confidence level of 99% for the periodicity discrimination 
task; and 88% of hits with 11.6% of multiples and 0.4% of misses 
in the period estimation task. 

Index Terms — Correntropy, information theory, time series 
analysis, period detection, period estimation, variable stars 

I. Introduction 

A light curve represents the brightness of a celestial object 
as a function of time (usually the magnitude of the star 
in the visible part of the electromagnetic radiation). Light 
curve analysis is an important tool in astrophysics used for 
estimation of stellar masses and distances to astronomical 
objects. By analyzing the light curves derived from the sky 
surveys, astronomers can perform tasks such as transient event 
detection, variable star detection and classification. 

There are a certain types of variable stars [ 1 1 whose bright- 
ness varies following regular cycles. Examples of this kind 
of stars are the pulsating variables and eclipsing binary stars. 
Pulsating stars, such as Cepheids and RR Lyrae, expand and 
contract periodically effectively changing their size, temper- 
ature and brightness. Eclipsing binaries, are systems of two 
stars with a common center of mass whose orbital plane is 
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aligned to Earth. Periodic drops in brightness are observed 
due to the mutual eclipses between the components of the 
system. Although most stars have at least some variation in 
luminosity, current ground based survey estimations indicate 
that 3% of the stars varying more than the sensitivity of the 
instruments and ~1% are periodic 0. 

Detecting periodicity and estimating the period of stars is 
of high importance in astronomy. The period is a key feature 
for classifying variable stars J3], 01; and estimating other 
parameters such as mass and distance to Earth Q. Period 
finding in light curves is also used as a means to find extrasolar 
planets (6). Light curve analysis is a particularly challenging 
task. Astronomical time series are unevenly sampled due to 
constraints in the observation schedules, the day-night cycle, 
bad weather conditions, equipment positioning, calibration and 
maintenance. Light curves are also affected by several noise 
sources such as light contamination from other astronomical 
sources near the line of sight (sky background), the atmo- 
sphere, the instruments and particularly the CCD detectors, 
among others. Moreover, spurious periods of one day, one 
month and one year are usually present in the data due to 
changes in the atmosphere and moon brightness. Another chal- 
lenge of light curve analysis is related to the number and size 
of the databases being build by astronomical surveys. Each 
observation phase of astronomical surveys such as MACHO 
0, OGLE (11, SDSS @ and Pan-STARRS QOj have captured 
tens of millions of light curves. Soon to arrive survey projects 
such as the LSST ifTTl will collect approximately 30 terabytes 
of data per night which translates into databases of 10 billion 
stars. 

Currently, most periodicity finding schemes used in astron- 
omy are interactive and/or rely somehow on human visual in- 
spection. This calls for automated and efficient computational 
methods capable of performing robust light curve analysis for 
large astronomical databases. 

In this paper we propose the Correntropy Kernelized Peri- 
odogram (CKP), a new metric for finding periodicities. The 
CKP combines the information theoretic learning (ITL) con- 
cept of correntropy |12| with a periodic kernel. The proposed 
metric yields a periodogram whose peaks are associated with 
the fundamental frequencies present in the data. A statistical 
criterion, based on the CKP, is used for periodicity discrim- 
ination. We demonstrate the advantages of using the CKP 
by comparing it with conventional spectrum estimators and 
related methods in databases of astronomical light curves. 
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II. Related methods 

Several methods have been developed to cope with the 
characteristics of light curves. The most widely used are the 
Lomb-Scargle (LS) periodogram fl3l . Ifl4"l . epoch folding, 
analysis of variance (AoV) ifTSI , string length (SL) methods 
[16 1, [17], and the discrete or slotted autocorrelation |18|, 
lfl9l . For the LS and AoV periodograms statistical confidence 
measures have been developed to assess periodicity detection 
besides estimating the period. 

The LS periodogram is an extension of the conventional 
periodogram for unevenly sampled time series. A sample 
power spectrum is obtained by fitting a trigonometric model in 
a least squares sense over the available randomly sampled data 
points. The maximum of the LS power spectrum corresponds 
to the angular frequency whose model best fits the time series. 
In epoch folding a trial period P t is used to obtain a phase 
diagram of the light curve by applying the modulus (mod) 
transformation of the time axis: 

, , . U mod P t 

MPt) = p , 

where ti are the time instants of the light curve. The trial pe- 
riod Pt is found by and ad-hoc method, or simply corresponds 
to a sweep among a range of values. This transformation is 
equivalent to dividing the light curve in segments of length P t 
and then plotting the segments one on top of another, hence 
folding the light curve. If the true period is used to fold the 
light curve, the periodic shape will be clearly seen in the phase 
diagram. If a wrong period is used instead, the phase diagram 
won't show a clear structure and it will look like noise. 

In AoV |[l"5l the folded light curve is binned and the ratio 
of the within-bins variance and the between-bins variance is 
computed. If the light curve is folded using its true period, 
the AoV statistic is expected to reach a minimum value. In 
SL methods, the light curve is folded using a trial period and 
the sum of distances between consecutive points (string) in 
the folded curve is computed. The true period is estimated 
by minimizing the string length on a range of trial periods. 
The true period is expected to yield the most ordered folded 
curve and hence the minimum total distance between points. 
In slotted autocorrelation lfT8l . lfT9l . time lags are defined 
as intervals or slots instead of single values. The slotted 
autocorrelation function at a certain time lag slot is computed 
by averaging the cross product between samples whose time 
differences fall in the given slot. 

All the related methods described above are based on 
second-order statistic analysis. Information theoretic based cri- 
teria extract information from the probability density function, 
therefore it includes higher-order statistical moments present 
in the data. This usually implies a better modelling of the 
underlying process and robustness to noise and outliers. 

The slotting technique was extended to the information the- 
oretic concept of correntropy in GUI . The slotted correntropy 
estimator was compared with the other mentioned techniques 
on period estimation of light curves from the MACHO survey, 
performing equally well on Cepheid/RR Lyrae and much better 
in eclipsing binaries period estimation. However, the slotted 



technique has the drawback that is highly dependant on the 
slot size. 

III. Background on ITL methods and periodic 

KERNELS 

A. Generalized correlation function: Correntropy 

In |[T2ll . 1211 an information theoretic functional capable of 
measuring the statistical magnitude distribution and the time 
structure of random processes was introduced. The generalized 
correlation function (GCF) or correntropy measures similari- 
ties between feature vectors separated by a certain time delay 
in input space. The similarities are measured in terms of inner 
products in a high-dimensional kernel space. For a random 
process {X t , t G T} with T being an index set, the correntropy 
function is defined as 

V(t 1 ,t 2 ) = E Xtl x t2 [n(x tl ,xt 2 )}, (1) 
and the centered correntropy is defined as 

Uih,^) =E XtiXt2 [K(x tl ,x t2 )} 
-E Xti E Xt2 [n(x tl ,x t2 )}, 

where E[-] denotes the expectation value and «;(■, •) is any 
positive definite kernel [12]. A kernel can be viewed as a 
similarity measure for the data [22|. The Gaussian kernel 
which is translation-invariant, is defined as follows: 

where a is the kernel size or bandwidth. The kernel size can be 
interpreted as the resolution in which the correntropy function 
search for similarities in the high-dimensional kernel feature 
space |fl~2), ll2D . The kernel size gives the user the ability 
to control the emphasis given to the higher-order moments 
with respect to second-order moments. For large values of the 
kernel size, the second-order moments have more relevance 
and the correntropy function approximates the conventional 
correlation. On the other hand if the kernel size is set too 
small, the correntropy function will not be able to discriminate 
between signal and noise and approximates the Dirac delta 
function. 

The name correntropy was coined due to its close relation 
to Renyi's quadratic entropy, which can be estimated through 
Parzen windows CPU as follows: 

H m {X) = -log(/P CT (X)), (4) 

where 

N N 

i=i j=i 

and N is the number of samples of the random variable X and 
a is the kernel size of the Gaussian kernel function. Equation 
Q is the argument of the logarithm in Eq. Q and is called the 
Information Potential (IP). The mean value of the correntropy 
function over the lags is a biased estimator of the IP lfl2l . 
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For a discrete strictly stationary random process {X n }, 
the univariate correntropy function or autocorrentropy can be 
defined as 

V[m] = E[n(x n ,x n - m )], 
which can be estimated through the sample mean 

1 N 

v A m ] = Tt rr ^ G a {x n - x n - m ). (6) 

N — m + 1 A — ' 

n—m 

Likewise, the estimator of the univariate centered correntropy 
function (Eq. [2]i is 



U a [m] 



N- 



1 N 

, 7 / , G a (x Tl x n _ m ) 
m + 1 ^— ' 



1 



N N 

y ' y ' G a (x n — x m ), (7) 

n— 1 rn — 1 

where the Gaussian kernel with kernel size a is used, TV is the 
number of samples of {X n } and m £ [1, iV] is the discrete 
time lag. In practice, the maximum lag should be chosen so 
that there are enough samples to estimate correntropy at each 
lag. Notice that the second term in Eq. |7]i corresponds to the 
IP (Eq. [3]), which is the mean of the autocorrentropy function 
over the lags. 

The Fourier transform of the centered autocorrentropy func- 
tion is called correntropy spectral density (CSD) and is defined 
as IE), ED, E3]: 



PAf] = 



oo 

E 

m— — o 



U a [m] ■ exp -j2irf 



(8) 



where U a [m] is the univariate centered correntropy function, 
and F s is the sampling frequency. The CSD can be consid- 
ered as a generalized power spectral density (PSD) function, 
although it is a function of the kernel size and it does not 
measure power. As with correntropy, the kernel size controls 
the influence of the higher-order moments versus the second- 
order statistical descriptors. Particularly, for large values of the 
kernel size, the CSD approximates the conventional PSD. 

In ll23ll the correntropy function and the CSD were used to 
solve the problem of detecting the fundamental frequency in 
speech signals. Correntropy outperformed conventional corre- 
lation, showing better discriminatory and robustness to noise. 
In 11241 correntropy was used to solve the blind source sepa- 
ration (BSS) problem, successfully separating signals coming 
from independent and identically distributed sources and also 
Gaussian sources. Correntropy outperformed methods that 
also make use of higher-order statistics such as Independent 
Component Analysis (ICA). In (25], correntropy was used 
as a discriminatory metric for the detection of nonlinearities 
in time series, outperforming traditional methods such as the 
Lyapunov exponents. 

B. Periodic kernel functions 

Kernels can also be viewed as covariance functions for 
correlated observations at different points of the input domain 



1 22]. In our research we are interested in measuring similarities 
among samples separated by a given period. A kernel function 
is periodic with period P if it repeats itself for input vectors 
separated by P. Periodic kernel functions are appropriate 
for nonparametric estimation, modelling and regression of 
periodic time series [26 1. Periodic kernel functions have also 
been proposed in the Gaussian processes literature [27], [28|, 

& 

A periodic kernel function can be obtained by applying 
a nonlinear mapping (or warping) u(z) to the input vector 
z. In ll28l a periodic kernel function was constructed by 
mapping a unidimensional input variable z using a periodic 
two-dimensional warping function defined as 



Up (z) 



2tt 
cos [ — z 



, sin 



2tt 
~P' 



(9) 



The periodic kernel function Gp ;(J (z — y) with period P, is 
obtained by applying the warping function (Eq. |9| to the inputs 
of the Gaussian kernel function (Eq. [3j. The periodic kernel 
function is defined as, 



G a - P (z -y) = G a {u P (z) - u P (y)) 



1 



exp 



K*-t/))' 



(10) 



V2tt<t 1 V, °- 5cr2 
where the following expression is used 



|| up (z) - up (y) || =4 sin 2 



k{z - y) 
P 



The periodic kernel function ( fT0] > is related to the von Mises 
probability density function [30|. 

IV. Correntropy kernelized periodogram 

In this paper we propose an ITL based method for finding 
periodicities in unevenly sampled time series. The proposed 
method does not require any resampling, slotting or folding 
scheme, as it is computed directly from the available samples 
and detects periodicity using the actual magnitudes and time 
instants of the samples. The new metric combines the cen- 
tered correntropy function and the periodic kernel function. 
For a discrete unidimensional random process {X n } with 
n = 1, ...,N, kernel sizes cr t and er m , and a trial period 
Pt, the proposed metric is computed as 



N N 



V P{ ^ m} (Pt) = ^ EE (^(A^) - IP) 



i=l j=l 
■ G^p^At^-wiAtij), 



(11) 



where Air,,- = 



-tj, G am (-) is the Gaussian 



kernel function (Eq. B), IP is the information potential (Eq. [5} 



Ga t -,p t ( ) is me periodic kernel function (Eq. 10 1 and w(Atij) 



is a Hamming window. 

In Eq. ( fTTj ) magnitude differences are evaluated using the 
Gaussian kernel function, while time differences are evaluated 
using the periodic kernel function. The new metric performs 
the pointwise multiplication between the centered Gaussian 
kernel coefficient (G am (Axij) — IP) and the periodic kernel 
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coefficients G crt - > p t (Atij). Notice that Eq. ( fTT| is a function 
of the trial period P t and it has two free parameters: the 
magnitude kernel size a m and the time kernel size a t . By 
analogy with the conventional periodogram (square of the 
discrete Fourier transform of the data) we call Eq. ( fTTj i the 
Correntropy Kernelized Periodogram (CKP). Another exten- 
sion of the conventional theory is the wavelet periodogram 
J3T). A Hamming window defined as 

w(AUj) = 0.54 + 0.46 • cos (^^^j , (12) 

where T is the total time span of the light curve, is used in Eq. 
(JTTJ to have a smoother estimation of the periodogram 11321 . 

The periodic kernel gives larger weights to the samples 
whose time difference are multiples of the the trial period. 
In other words, the periodic kernel emphasizes the magnitude 
differences that are separated by the trial period and its 
multiples. Intuitively, for a periodic time series, the magni- 
tude differences between samples separated by the underlying 
period in time are expected to be minimum (most similar). If 
the trial period of the periodic kernel is set to the true period, 
the metric is expected to reach a maximum value. By using 
the maximum value of Eq. (jTTJ over a set of trial periods, the 
best estimation of the underlying period is obtained. 

In order to compare periodograms from different time series, 
the CKP has to be invariant to the data scale. In fPH a 
scale-invariant criterion based on Renyi's quadratic entropy 
is proposed. The condition for Gaussian variables is that the 
magnitude's kernel size should be directly proportional to 
the spread of the data. Herein, we use this condition as an 
approximate method to make the representation scale invariant. 
In Eq. ( fTT| we set a m as follows, 

<r m {X n ) = k • min {0.7413 iqr(X„), std(X„)} , (13) 

where k is a constant, iqr is the interquartile rang^j] and 
std is the standard deviation |[T2l . The iqr is a measure of 
statistical dispersion, being equal to the difference between 
the third and first quartiles of the data. The first and third 
quartiles are the medians of the first half and second half of 
the rank-ordered data, respectively. For a normal distribution 
the standard deviation is equal to 0.7413 iqr. The selection of 
constant k is discussed in Section IVI-AI 

Fig. 1 shows an example using a synthetic time series to 



illustrate the effect of the proposed metric. Fig. la shows 
a synthetic time series t/, = sin(27rt i /P) + 0.8 • e$, with 
l (i + 0.5 • Ei), where ej and Ej are normally 



U = 



N 



distributed random variables with zero mean and unit standard 
deviation. The noise in time simulates uneven sampling. In 
this example N = 400, T max — 25, and the underlying 
period is P = 2.456 seconds. Fig. [Tb] shows the kernel 
coefficients (G (Tm (Axij(Atij)) — IP) and G at -p t (Atij) as 
a function of the time differences collected from the time 
series. The magnitude kernel size is set using Eq. ( fl~3| i with 
k = 0.3. The time kernel size and the trial period are set to 



a t = 0.1, P t = 2.456, respectively. Fig. lc shows the CKP 




5 10 15 

Time[s] 

(a) 



20 25 




12 3 4 

Time differences [s] 

(b) 




'If the data does not follow a normal distribution or it contains outliers, the 
interquartile range will provide a better spread estimation than the standard 
deviation because it uses the middle 50% of the data 



0.4 0.6 
Frequency [Hz] 

(c) 

Fig. 1. (a) Synthetic time series sin(2izt/P) with period P = 
2.456 seconds plus Gaussian noise. The time instants have been ran- 
domly perturbed to simulate uneven sampling, (b) Kernel coefficients, 
(G CTm (Axij (Atij)) — IP) and G at -p t (Aty), as a function of the time 
differences. The CKP is the pointwise product between the centered Gaussian 
kernel coefficients and the periodic kernel coefficients, (c) CKP as a function 
of the trial frequency, the dotted line marks the location of the true period. 
The global maximum of the CKP corresponds to the underlying period. 

for a range of periods, the location of the underlying period 
in the periodogram is marked with a dotted line. The CKP 
reaches a global maximum at the corresponding underlying 
period P = 2.456. 



A. A statistical test based on the CKP for periodicity discrim- 
ination 

For a periodic time series with an oscillation frequency /, 
its periodogram will exhibit a peak at that frequency with high 
probability. But the inverse is not necessarily true, a peak 
in the periodogram does not imply that the time series is 
periodic. Spurious peaks may be produced by measurement 
errors, random fluctuations, aliasing or noise. 
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In this section, a statistical test for periodicity is introduced, 
using the global maximum of the CKP as test statistic. The null 
hypothesis is that there are no significant periodic components 
in the time series. The alternative hypothesis is that the CKP 
maximum corresponds to a true periodicity. The distribution 
of the maximum value of the CKP is obtained through Monte- 
Carlo simulations. Surrogate time series 11331 . 11341 are used to 
test the null hypothesis. The surrogate generation algorithm 
has to be consistent with the null hypothesis. To achieve this, 
the block bootstrap method 11351 . which breaks periodicities 
preserving the noise characteristic and time correlations of 
the light curve, is used. The procedure used to construct an 
unevenly sampled surrogate using the block bootstrap method 
is as follows 

1) Obtain a data block from the light curve by randomly 
selecting a block of length L and a starting point j e 
[1,N-L]. 

2) Subtract the first time instant of the block, so that it 
starts at days. 

3) Add the value of the last time instant of the previous 
block to the time instants of the current block. 

4) Parse the current block to the surrogate time series. 

5) Repeat steps 1-4 until the surrogate time series have the 
same amount of samples of the original light curve. 

For a given significance level a and kernel sizes at and a m , 
the null hypothesis is rejected if 



maxVp {o 



f }(/) > V p{cr m ,<r t }> 



where for N light curves V3 a CTf j is pre-computed as follows: 

1) Generate M surrogates from each light curve using 
block bootstrap. 

2) Save the maximum CKP ordinate value of each surro- 
gate. 

3) Find P a such that a (1 — ct)% of the ordinate values 
saved from the surrogates are below this threshold (one- 
tailed distribution). 

4) Compute V£t a at ^ as the mean P a and its correspond- 
ing error bars as the standard deviation of P a for the N 
light curves (N ■ M surrogates). 

V. Period Detection Method 
A. Description of the MACHO database 

The MACHO project Q was designed to search for gravi- 
tational microlensing events in the Magellanic Clouds and the 
galactic bulge. The project started in 1992 and concluded in 
1999. More than 20 million stars were surveyed. The MACHO 
project has been an important source for finding variable stars. 
The complete light curve database is available through the 
MACHO project's websit^] There are two light curves per 
stellar object: channels blue and red. Only the blue channel 
light curves are used here. Each light-curve has approximately 
1000 samples and contains 3 data columns: time, magnitude 
and an error estimation for the magnitude. 

Astronomers from the Harvard Time Series Center (TSC) 
have a catalog of variable stars from the MACHO survey. 

2 http://wwwmacho. anu.edu.au/ 



The underlying periods of the periodic variable stars were 
estimated using epoch folding, AoV, and visual inspection. 
In this paper, we consider the TSC periods to be the gold 
standard. 

A subset of 1500 periodic light curves (500 Cepheids, 500 
RR Lyrae and 500 eclipsing binaries) and 3500 non-periodic 
light curves was drawn from the MACHO survey. The subset 
was divided into a training set for parameter adjustment and 
a testing set. The training set consisted of 2500 light curves 
(750 periodic and 1750 non periodic) randomly selected from 
the available classes. The remaining 2500 light curves were 
used for testing purposes. 

There is a natural imbalance between periodic and aperiodic 
classes of stars. Only 3% of the surveyed stars are expected 
to be variables and ~1% to be periodic. Due to this, when 
detecting periodic behaviour, we have to achieve a false 
positive rate less than 0.1%. 

B. Description of the procedure for periodicity detection 

In what follows, the steps of the periodicity detection 
algorithm, for a given time kernel size a t and magnitude kernel 
size <r m , are described. 

1) Cleaning: The light curve's blue channel is imported. 
The mean e and the standard deviation a e of the 
photometric error are computed. Samples that do not 
comply with e, < e + 3 ■ a e , where e, is the photometric 
error of sample i, are discarded. 

2) Computing the periodogram The CKP (Eq. [TT 



3) 



4) 



5) 



computed on 20000 logarithmically spaced periods be- 
tween 0.4 days and 300 days. The periods associated 
to the ten highest local maxima at the periodogram are 
saved as trial periods for the next step. 
Fine-tuning of trial periods: The CKP is used to fine 
tune the ten trial periods. Each trial period is fine-tuned 
around a 0.5% of its value ([1.0025 • f triah 0.9975 • 
ftrial]), using a step size of df = in frequency, 
where T is the total time span of the light curve. 
Selection of the best trial period: The trial periods are 
sorted in descending order following its CKP ordinate 
value. The best trial period Pb es t is selected as the one 
with the highest value of Vp, that is not a multiple of a 
spurious period, as described below. 
Finally, if the best period comply with 
Vp{a m .a t }(Pbest) > & then the light curve is labeled 
as periodic, where 9 is the periodogram threshold for 
periodicity. The confidence associated to 9 is obtained 



using the procedure described in Section IV-A 



Table [I] gives a summary of the parameters of the proposed 
method. The kernel sizes and the periodicity threshold do not 
appear in Table [I] because they need to be calibrated using a 
procedure described in the following sections. 

To obtain the spurious periods the following spectral win- 
dow function is used 



W(f) 



1 

N 



N 



(14) 
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TABLE I 

Summary of parameters of the period detection pipeline 



Description 


Value 


Minimum period 


0.4 days 


Maximum period 


300 days 


Periodogram resolution 


20.000 periods 3 


Number of fine-tuned trial periods 


10 


Fine-tune resolution 


periods 4 



where ti with i = 1 , . . . , N are the time instants of the light 
curve. Eq. ( p~4] > is the periodogram of the sampling pattern 
of the light curve. The frequencies associated to the peaks 
of the spectral window are related to spurious periodicities 
caused by the sampling. In most cases, the spurious periods 
obtained from the spectral window are multiples of the sidereal 
day (0.99727 days) and yearly periodicities (365.25 days). 
The moon phase period (29.53 days) is also added to the list 
of spurious periods. The moon phase has no relation to the 
sampling but it is intrinsic to the data. The daily sampling 
produces alias peaks of the true periods (P) in the CKP at 
(1/P+k) 1/days, where k € Z. The CKP values of the aliases 
are very low and therefore they do not need to be filtered out. 

C. Computational time 

The algorithm^] was programmed in C/CUDA and imple- 
mented in a Graphical Processing Unit (GPU) NVIDIA Tesla 
C2050 with 448 cuda cores. The computational time required 
to process one light curve (1000 samples) is 1.812 seconds. 
The total time to process the 5000 light curves is 2.5 hours. 

D. Performance criteria 

In what follows we define performance measures for the 
problems of period detection and period estimation. The 
former consists of discriminating between periodic and non 
periodic light curves. The latter consists in estimating the true 
periods of periodic light curves. For the period estimation 
problem the classification is done by using the TSC periods 
as golden standard. An estimated period P est is classified as 
either a Hit, a Multiple or a Miss with respect to the reference 
period P re f according to the following criteria: 

. Hit if \P ref - P est \ < e 

• Multiple if P est > P re f and 



P, 



Pref 

or if P est < P ref and 

Pref 



P,, 



P 



ref 



P. 



P 



ref 



P. 



< e, 



< s, 



where [^J is the largest integer less than or equal x. 

• Miss if it does not belong to any of the other categories. 

The tolerance value e controls the accepted relative error 
between the estimated period and the reference period. A value 
of e = 0.005, i.e. a relative error of 0.5% will be considered, 

3 Logarithmically spaced. 

4 Linearly spaced, T is the total time span of the light curve. 

5 The C-language implementation is available by request to the authors. 



small enough to obtain a clean folded curve from the estimated 
period. 

The problem of period detection in light curves can be 
treated as a binary classification problem. The classes are 
periodic light curves {+1} and non-periodic light curves 
{ — 1}. Confusion matrix and Receiver Operating Characteristic 
(ROC) curves are used to evaluate the period detection method. 
An ROC curve is a plot of the true positive rate (TPR) as a 
function of the false positive rate (FPR). Different points in 
the ROC curve are obtained by changing the threshold value 
at the output of the classifier. In this case, 

• true positive (TP) is a periodic light curve classified as 
periodic 

• false positive (FP) is a non-periodic light curve classified 
as periodic 

• true negative (TN) is a non-periodic light curve classified 
as non-periodic 

• false negative (FN) is a periodic light curve classified as 
non-periodic 

The TPR represents the proportion of periodic light curves 
that are correctly identified as such. The FPR represents the 
proportion of non-periodic light curves that are incorrectly 
classified as periodic. 

VI. Experiments 

A. Parameter calibration 

The CKP is a function of the kernel sizes a m and cr t . These 
parameters are adjusted using the 2500 light curves in the 
training set. The value of a m is set using Eq. ( fT3| , so we need 
to choose the value of the constant k. Fig. [2] shows the CKP 
as function of the frequency and at for light curve 1.3449.27 
from the MACHO catalog. In this example the underlying 
period is picked as the global maximum of the CKP for a time 
kernel size a t <G [0.075,0.125]. After extensive experiments 
we identified that this particular range of kernel sizes values 
gives the best results for the MACHO light curves. There 
is no clear rule for choosing the time kernel size, although 
intuitively, it should depend on the sampling pattern. 

The period estimation results for different combinations of 
k and <r t on the 750 periodic light curves of the training set 
are shown in Table [TT| The best performance is obtained with 
k = 1 and a t = 0.1. 

Fig. [3] compares the ROC curves obtained using different 
combination of both kernel sizes for the period detection 
problem. As mentioned before, due to the imbalance between 
periodic and non-periodic light curve classes, false positive 
rates below 0.1% are desired. Looking at the ROC curves it is 
clear that the best kernel size combination, in the area below 
1% FPR, is k = 1 and a t — 0.1. These values are fixed for 
the following experiments. 

B. Statistical significance 

Using the procedure described in Section |IV-A| statistical 
significance thresholds for the CKP were computed. Table 
III shows the significance thresholds and their corresponding 
CKP ordinate values for the best combination of kernel sizes 
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Fig. 2. CKP versus frequency and time kernel size at for light curve 
1.3449.27 from the MACHO catalog. The underlying period of 4.0348 days 
is found as the maximum of the CKP with a t g [0.075,0.125]. 

TABLE II 

Period estimation performance of the CKP using different 

COMBINATIONS OF KERNEL SIZES FOR THE TRAINING DATABASE 



k 


at 


Hits[%] 


Multiples[%] 


Misses[%] 


0.75 


0.05 


75.47 


23.47 


1.07 


0.75 


0.1 


86.13 


13.60 


0.27 


0.75 


0.15 


85.33 


13.87 


0.80 


1 


0.05 


82.27 


16.93 


0.80 


1 


0.1 


88.67 


11.20 


0.13 


1 


0.15 


86.93 


12.67 


0.40 


1.25 


0.05 


76.40 


22.13 


1.47 


1.25 


0.1 


85.73 


14.13 


0.13 


1.25 


0.15 


85.47 


14.00 


0.53 



(a t = 0.1 and k = 1). The thresholds were computed using 
the light curve training set (N = 2500) and five hundred surro- 
gates per light curve (M — 500). Fig. |4] shows the location of 
these thresholds in the ROC curve of the testing dataset. FPR 
rates below 1% are associated with confidence levels between 
95% and 99%. Fig. [5] shows three light curves in which the 
CKP ordinate value associated with the fundamental period has 
a confidence level higher than 99%. In the folded light curves 
(Fig. [5] right column) the periodic nature of the light curve 
can be clearly observed. In period detection schemes based 
on visual inspection these light curves would be undoubtedly 
labeled as periodic. Fig. [6] shows three light curves in which 
the CKP ordinate values associated with the fundamental 
period have a statistical confidence between 90% and 95%. 
These light curves are indeed periodic although compared to 
the previous three (shown in Fig. [5j, their periodicity is less 
clear as their signal to noise ratio is smaller. By associating a 
statistical level of confidence to the detected periods, we have 
obtained a way to set a period detection threshold and also a 
better interpretation of period quality. The level of confidence 
on the detected period could be used in later (post-processing) 
stages of the period detection pipeline. For example, periods 
with lower confidence levels may be selected for additional 
analysis stages in which finer resolution or different parameter 
combinations may be used. Fig. [7] shows examples of periods 
associated to the sidereal day (0.99727 days) and moon phase 
(29.53 days). These periods are discarded as spurious as 
described in Section [V-B I 




, k: 0.75, a t : 0.1 

, k: 0.75, a : 0.2 

k: 1,a :0.1 
■ t 

,k: 1,o : 0.2 

k: 1.25,0. : 0.1 
k: 1.25, a : 0.2 



0.5 1 1.5 2 

False Positive Rate [%] 

Fig. 3. Receiver operating characteristic curves for the period detection 
problem using different combinations of kernel sizes in the training database. 
TABLE III 

STATISTICAL SIGNIFICANCE THRESHOLDS OF THE CKP WITH at = 0.1 
AND k = 1. 

1 - a V<[_ 



0.99 3.59e-4 
0.95 3.12e-4 
0.90 2.80e - 4 



C. Influence of the periodic kernel 

In this experiment we demonstrate the advantages of the 
kernelized periodogram with respect to the conventional spec- 
trum estimator. The quadratic correntropy spectrum (QCS) is 
defined as follows: 



P. 



N N 



Xf) = l_^{G„ m {&x ij )-IP) 

i=l j=l 

•exp(-j27r/Aiy) • w(Atij), 



(15) 



where Axij = Xi—Xj, Atij — ti~tj, and IP is the information 



potential estimator (Eq. [5 i. Notice that Eq. ( fl"5j ) differs from 
the CSD definition (Eq. 8l) in two aspects. First the CSD 
definition uses integer lags, while the QCS uses directly the 
difference between samples. This change is useful to deal 
with unevenly sampled light curves. Secondly, we include 
in Eq. ( p"5| ) a Hamming window (Eq. [l2] i for enhancing the 
spectrum estimation. The constructed quadratic correntropy 
spectrum (QCS) is similar to the CKP (Eq. Hi except that 



the periodic kernel has been replaced by exp(— j2nfAt) to 
compute the Discrete Fourier Transform. First, we use an 
example to illustrate the differences between periodograms 
using a single light curve, from object 1.3449.27 of the 
MACHO survey. This light curve corresponds to an eclipsing 
binary with fundamental period 4.03488 days. Fig. [8] shows 
the CKP and QCS estimators for light curve 1.3449.27. The 
CKP was computed using k — 1 and at = 0.1. The same 
kernel size was used for the QCS. The dotted line marks the 
location of the underlying period. The global maximum of the 
CKP is associated with 4.0349 days, which corresponds to 
the underlying period of the light curve. Other local maxima 
appear at multiples and sub-multiples of the underlying period. 
On the other hand, the global maximum of the QCS is 
associated to 2.0175 days, the closest integer sub-multiple 
of the underlying period. The QCS value at the frequency 
corresponding to the true period is very small, in fact smaller 
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Fig. 4. ROC curve of the CKP with at = 0.1 and k = 1 in the test subset. 
The significance thresholds of the CKP are shown in the ROC curve. 
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Fig. 5. Examples of periodic light curves detected by the CKP method with 
a level of confidence greater than 99%. The original light curves are shown 
on the left column. The right column shows the same light curves folded with 
the estimated period. 

than many other spurious peaks. The procedure described in 
Section |V-B| is used to evaluate the performance of QCS 
estimator. Table IV shows the results obtained by the QCS and 





CKP estimators on the testing subset. The CKP obtains 12% 
more hits and 70% less misses than the QCS estimator, which 
shows clearly the advantages of the kernelized periodogram. 

D. Comparison with other methods 

The performance of the CKP method was compared with 
the slotted correntropy and other widely used techniques 
in astronomy. The software VarTools l36l . 071 was used 




1000 2000 
Time [days] 



Fig. 6. Examples of periodic light curves detected by the CKP method 
with a level of confidence between 90% and 95%. The original light curves 
are shown on the left column. The right column shows the same light curves 
folded with the estimated period. 

TABLE IV 

Period estimation performance of the CKP versus the QCS for 

THE TESTING DATABASE 



Method 


Hits[%] 


Multiples[%] 


Misses[%] 


CKP 


88.00 


11.60 


0.4 


QCS 


77.87 


20.80 


1.33 



to perform a Lomb-Scargle periodogram and Analysis of 
Variance analysis. The regularized Lafler-Kinman string length 
(SLLK) statistic and the slotted autocorrelation were also 
considered. For Vartools LS, the period associated with the 
highest peak of the LS periodogram, that is not a multiple 
of the known spurious periods (sidereal day, moon phase), 
gives the estimated period. A periodogram resolution of 0.1/T 
and a fine tune resolution of 0.01/T, where T is the total 
time span of the light curve, were used. For Vartools AoV 
and SLLK, the corresponding statistics are minimized in an 
array of periods ranging from 0.4 to 300 days with a step 
size of le-4. For AoV the default value of 8 bins is used. For 
AoV and SLLK, the period that minimizes the corresponding 
metrics, that is not a multiple of the known spurious periods, 
is selected as the estimated period. For the slotted autocor- 
relation/autocorrentropy the highest peak of the PSD/CSD 
estimator function, that is not associated to the known spurious 
periods, delivers the estimated period. For the slotted autocor- 
rentropy/autocorrelation a slot size At = 0.25 was considered 
1 20]. For the CKP method the best combination of kernel sizes 
(a t = 0.1 and k = 1) obtained with the training dataset is 
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Fig. 7. Examples of spurious periods found among the top peaks of the 
CKP. The original light curves are shown on the left column. The right column 
shows the folded light curves with the spurious period. These spurious periods 
are related to periodic daily changes in the atmosphere and periodic monthly 
changes in the brightness of the moon. These periods are discarded as spurious 
by our method. 
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Fig. 8. Comparison between the periodograms obtained using the proposed 
CKP metric and the QCS. The dotted line marks the true period. The 
underlying period corresponds to the global maximum of the CKP. On the 
contrary, the QCS value of the true period is very small. 



used. The influence of the higher-order statistical moments is 
assessed by comparing the CKP with a linear version of the 
proposed metric. In this linear version, the Gaussian kernel 
used to compare magnitude values is replaced by a linear 
kernel. The periodic kernel remains unchanged. The procedure 
to detect a period is the same as explained in Section |V-B| 
with an additional pre-processing step where the data vector 
is zero-mean normalized. The results for period estimation on 
the testing subset are shown in Table [V] The CKP method 
obtained the highest hit rate (88%), followed by the linear 
version of the CKP (80%), the slotted correntropy (78%) and 
the AoV periodogram (75%). The CKP obtained 8.6% more 
hits and 57% less misses than its linear version. This is because 
the Gaussian kernel incorporates all the even-order moments 



TABLE V 

Period estimation performance of the CKP method versus 

OTHER TECHNIQUES FOR THE TESTING DATABASE 



Method 


Hits[%] 


Multiples[%] 


Misses[%] 


CKP 


88.00 


11.60 


0.4 


CKP (linear kernel) 


80.40 


18.67 


0.93 


Slotted correntropy 


78.80 


19.73 


1.47 


Slotted correlation 


70.00 


28.80 


1.20 


VarTools LS 


61.73 


36.00 


2.27 


VarTools AoV 


75.33 


23.60 


1.07 


SLLK 


71.47 


26.27 


2.27 
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Fig. 9. Receiver operating characteristic curves of the CKP (k 
0.1), the LS periodogram and the AoV periodogram. 



1, at 



of the process and gives robustness to outlier data samples 
which are common in astronomical time series. The CKP 
obtained 10.5% more hits and 72.7% less misses than the 
slotted correntropy. This is because in the slotted correntropy, 
kernel coefficients are averaged on time slots, therefore the 
actual time differences between samples are not used. Out of 
the cases where the correct period is found by the CKP but 
not by the AoV periodogram, an 84% corresponds to eclipsing 
binaries. This is expected as conventional methods perform 
well on pulsating variables ||20l . Eclipsing binaries light curves 
are typically more difficult to analyze as their variations are 
non-sinusoidal and due to their morphology/shape characteris- 
tics most methods tend to return harmonics or sub-harmonics 
rather than the true period. The proposed method obtained 
the lowest miss rate (0.4%). In all these missed cases^] the 
true period was correctly found by the proposed metric but 
they were filtered out for being too near to the sidereal day. 
More accurate ways of filtering out spurious periods, using 
the data samples instead of a straightforward comparison of 
the detected periods, could be implemented to recover such 
missed cases. Fig.[9]shows ROC curves for the task of periodic 
versus non-periodic discrimination in the 2500 light curve 
testing subset. The CKP is compared with the LS and AoV 
periodograms. The proposed method clearly outperforms its 
competitors in the FPR range of interest (below 1%). It is 
worth noting that even if a harmonic of the true period is 
found, periodicity can still be detected. This is true for all the 
methods as periodicity detection rates are comparatively better 
than Hit rates obtained for period estimation. 

6 Light curves 1.4539.37, 3.6605.124 and 6.5726.1276, with periods 2.9955 
(three times sidereal day), 3.9813 (four times the sidereal day) and 0.99676, 
respectively. 
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VII. Conclusion 

We have proposed a new metric for periodicity finding based 
on information theoretic concepts. The CKP metric yields a 
kernelized periodogram. It has been shown that the proposed 
method has several advantages over the correntropy spectral 
density and other conventional methods of period detection 
and estimation. The CKP is computed directly from the actual 
magnitudes and time-instants of samples. It does not require 
resampling, slotting nor folding schemes as other methods. 
The CKP metric is used as the main part of a fully-automated 
pipeline for period detection and estimation in astronomical 
time series. The CKP metric is used as test statistic to estimate 
the confidence level of the period detection. 

Results on a subset of the MACHO survey shows that 
the CKP metric clearly outperforms the LS and AoV peri- 
odograms in the period detection of light curves. Moreover, the 
CKP method clearly outperforms slotted correntropy, slotted 
correlation, LS -periodogram, AoV and SLLK methods on the 
period estimation of periodic light curves. This is because 
the CKP method incorporates higher-order moments when 
computing the periodogram, and it emphasizes the trial period 
and its multiples. Future work will focus on enhancing the 
spurious rejection criterion, developing and adaptive kernel 
size adjusting rule, and discriminating quasi-periodic and 
multi-periodic light curves. 
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